The United States is the largest economy in the world. Based on realtrends article, real estate accounts for 16.9% of US's GDP on 2021 which is considered a major GDP contributor. The following analysis aims to provide viewers with real estate insights that may facilitate general understanding and potential investment decisions through market trends and factors influencing the real estate market.
# Importing Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import statsmodels.api as sm
import matplotlib.pyplot as plt
from IPython.display import HTML
HTML('''
<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()">
<input type="submit" value="Toggle Code">
</form>
''')
HTML('''<style>.input {display:none;}</style>''')
# Importing data
raw_real_estate = pd.read_csv("realtor_data.csv")
This stage contains initial exploration of the real estate dataset. It involves identifying missing values, reformating certain columns for analysis purpose, filtering out duplicated datas and keep thedata required for analysis purpose.
# Initial Exploration
# Identifying the shape of the dataset
shape = raw_real_estate.shape
# identifying the data types of all the columns
data_types = raw_real_estate.dtypes
# Identifying the number of missing values in each the columns
missing_values = raw_real_estate.isna().sum().reset_index()
missing_values.columns = ['Column_Names', 'Missing_value_Count']
display(missing_values)
| Column_Names | Missing_value_Count | |
|---|---|---|
| 0 | status | 0 |
| 1 | price | 71 |
| 2 | bed | 131703 |
| 3 | bath | 115192 |
| 4 | acre_lot | 273623 |
| 5 | full_address | 0 |
| 6 | street | 2138 |
| 7 | city | 74 |
| 8 | state | 0 |
| 9 | zip_code | 205 |
| 10 | house_size | 297843 |
| 11 | sold_date | 466763 |
# Changing the date column into "date" data type
raw_real_estate['sold_date'] = pd.to_datetime(raw_real_estate['sold_date'], format = '%Y-%m-%d')
# Removing rows with missing values in certain columns
# raw_real_estate = raw_real_estate.dropna(subset = ['price', 'bed', 'bath', 'acre_lot', 'zip_code', 'house_size', 'sold_date'])
raw_real_estate = raw_real_estate.dropna(subset = ['bed', 'bath', 'house_size', 'sold_date'])
display(raw_real_estate.shape)
(310141, 12)
## Filtering for datas for years between year 2000 and 2022 (ordered)
# Finding the max and min dates in the dataset
raw_real_estate['sold_date'].agg(['min', 'max'])
# Filtering the date from the range of dates in order
raw_real_estate = raw_real_estate[(raw_real_estate['sold_date'] >= '2000-01-01')
& (raw_real_estate['sold_date'] <= '2022-12-31')].sort_values("sold_date")
# Filtering the columns needed for further analysis (dropping the below 2 columns)
final_real_estate = raw_real_estate.drop(columns = ["street", "acre_lot"])
# Removing Duplicate Values
final_real_estate = final_real_estate.drop_duplicates(subset = ["full_address", "sold_date", "zip_code"])
# Reformating the price column
final_real_estate['price'] = final_real_estate['price']/1000
# Renaming the price column
final_real_estate.rename(columns = {"price": "Sale_Price_Thousands"}, inplace = True)
# Create the column called "year"
final_real_estate['year'] = pd.DatetimeIndex(final_real_estate['sold_date']).year
# Adding a new column called Price per Squre Foot (Value for Money)
final_real_estate['price_per_sqft'] = (final_real_estate['Sale_Price_Thousands']/final_real_estate['house_size']) * 1000
# Adding a new column called room ratio (feeling of balance of the house)
final_real_estate['room_ratio'] = final_real_estate['bed']/final_real_estate['bath']
display(final_real_estate.shape)
(33941, 13)
# Final Check for Missing Values
missing_values_new = final_real_estate.isna().sum().reset_index()
missing_values_new.columns = ['Column_Names', 'Missing_value_Count']
| Column_Names | Missing_value_Count | |
|---|---|---|
| 0 | status | 0 |
| 1 | Sale_Price_Thousands | 0 |
| 2 | bed | 0 |
| 3 | bath | 0 |
| 4 | full_address | 0 |
| 5 | city | 0 |
| 6 | state | 0 |
| 7 | zip_code | 0 |
| 8 | house_size | 0 |
| 9 | sold_date | 0 |
| 10 | year | 0 |
| 11 | price_per_sqft | 0 |
| 12 | room_ratio | 0 |
# Downloading the new cleaned dataset
final_real_estate.to_csv("final_real_estate.csv", index = False)
display(final_real_estate.head(10))
| status | Sale_Price_Thousands | bed | bath | full_address | city | state | zip_code | house_size | sold_date | year | price_per_sqft | room_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 439526 | for_sale | 105.0 | 2.0 | 2.0 | 155 Bull Hill Ln Apt 309, West Haven, CT, 06516 | West Haven | Connecticut | 6516.0 | 989.0 | 2000-01-03 | 2000 | 106.167846 | 1.000000 |
| 920877 | for_sale | 225.0 | 3.0 | 2.0 | 21 Morris Dr, Newburgh, NY, 12550 | Newburgh | New York | 12550.0 | 1245.0 | 2000-01-03 | 2000 | 180.722892 | 1.500000 |
| 778538 | for_sale | 695.0 | 6.0 | 2.0 | 1324 Star Ave, Elmont, NY, 11003 | Elmont | New York | 11003.0 | 1770.0 | 2000-01-04 | 2000 | 392.655367 | 3.000000 |
| 623442 | for_sale | 289.0 | 1.0 | 1.0 | 113 Cleveland Ln, Rockaway, NJ, 07866 | Rockaway | New Jersey | 7866.0 | 915.0 | 2000-01-04 | 2000 | 315.846995 | 1.000000 |
| 450571 | for_sale | 365.0 | 2.0 | 1.0 | 49 Tinker Hill Rd, Washington, CT, 06777 | Washington | Connecticut | 6777.0 | 720.0 | 2000-01-04 | 2000 | 506.944444 | 2.000000 |
| 217002 | for_sale | 350.0 | 3.0 | 2.0 | 31 Benefit St, Pawtucket, RI, 02861 | Pawtucket | Rhode Island | 2861.0 | 1500.0 | 2000-01-04 | 2000 | 233.333333 | 1.500000 |
| 654732 | for_sale | 451.0 | 4.0 | 3.0 | 16 Underwood Ct, Burlington, NJ, 08016 | Burlington | New Jersey | 8016.0 | 2085.0 | 2000-01-04 | 2000 | 216.306954 | 1.333333 |
| 612420 | for_sale | 85.0 | 3.0 | 1.0 | 1126 Chestnut St, Wilmington, DE, 19805 | Wilmington | Delaware | 19805.0 | 1075.0 | 2000-01-04 | 2000 | 79.069767 | 3.000000 |
| 586984 | for_sale | 949.9 | 5.0 | 5.0 | 5 Country Ln, Tewksbury Township, NJ, 07830 | Tewksbury Township | New Jersey | 7830.0 | 4508.0 | 2000-01-04 | 2000 | 210.714286 | 1.000000 |
| 459260 | for_sale | 349.0 | 3.0 | 2.0 | 11 Dartmouth Ln, Danbury, CT, 06810 | Danbury | Connecticut | 6810.0 | 1170.0 | 2000-01-05 | 2000 | 298.290598 | 1.500000 |
A usual business cycle can be identified in the time series plot below where the peaks occured at 2005 and 2017 and the downturn at 2000, 2011 and 2022. The downturn occured at 2011 maybe considered weak recoveries from the 2008 global financial crisis and 2022 downturn maybe caused by the covid pandemic and the rise of interest rates.
# Finding the sum of real estate sales Volume in each of the year
General_Sales_Volume = final_real_estate['year'].value_counts().sort_index()
# Plotting the Real Estate Sales Volume over the years
Sales_Volume_Plot = General_Sales_Volume.plot(title = "Total Real Estate Sales Volume from 2000 to 2022",
xlabel = "year", ylabel = "Sales_Volume", kind = "line")
The real estate sales value time series plot below shows a similar pattern as the sales volume time series plot. This may be due to the large value each real estate carries which makes changes in real estate sales very obvious.
# Finding the sum of real estate sales Value in each of the year
General_Sales_Values = pd.DataFrame(final_real_estate.groupby('year')['Sale_Price_Thousands'].sum())
Sales_Revenue_Plot = General_Sales_Values.plot(title = "Total Real Estate Sales Values from 2000 to 2022",
ylabel = "Sales_Price_Thousands", kind = "line", legend = False)
Currently Virgin islands is a state with the highest median property price however, there were only datas on 2 properties listed on Virgin Island and one of the property price was heavily skewed thus creating an outlier. Therefore, its not a good indicator of a median prices in Virgin Islands and thus will be removed from the chart for more accurate analysis.
# find the median house prices in each state
median_prices = final_real_estate.groupby("state")['Sale_Price_Thousands'].median().reset_index().sort_values(by='Sale_Price_Thousands', ascending=False)
median_prices_except_highest = median_prices.iloc[1:]
plt.figure(figsize=(12, 6))
sns.barplot(data=median_prices_except_highest, x='state', y='Sale_Price_Thousands', palette='viridis')
plt.xticks(rotation=90)
plt.xlabel("State")
plt.ylabel("Median Property Price (Thousands $)")
plt.title("Median Property Price by State")
plt.show()
# identifying all the properties from Virgin Islands
virgin_islands_rows = final_real_estate[final_real_estate["state"] == "Virgin Islands"]
display(virgin_islands_rows)
| status | Sale_Price_Thousands | bed | bath | full_address | city | state | zip_code | house_size | sold_date | year | price_per_sqft | room_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10344 | for_sale | 950.0 | 5.0 | 4.0 | 46 Water Isle, Saint Thomas, VI, 00802 | Saint Thomas | Virgin Islands | 802.0 | 5000.0 | 2013-10-11 | 2013 | 190.000000 | 1.250000 |
| 10351 | for_sale | 6899.0 | 4.0 | 6.0 | 10 Water Isle, Saint Thomas, VI, 00802 | Saint Thomas | Virgin Islands | 802.0 | 4600.0 | 2018-04-05 | 2018 | 1499.782609 | 0.666667 |
New Jersey has the highest Real Estate sales volume across 22 years while New York is slightly behind.
#2. Identifying the sales volume based on states
# Identifying the number of states available in the US
final_real_estate['state'].nunique()
# Identifying the top 5 states in terms of real estate sales volume
final_real_estate['state'].value_counts().sort_values().tail(5).plot(kind = 'barh',
color = ['Green', 'Blue', 'Orange', 'Brown', 'Purple'])
plt.ylabel("States")
plt.xlabel("Real Estate Sales Volume")
plt.title("Top 5 US States Real Estate Sales Volume")
Text(0.5, 1.0, 'Top 5 US States Real Estate Sales Volume')
This line graph shows the annual property sales volume from 2000 to 2022 across the top 5 U.S. states. New Jersey had the highest peak in 2021, while most states showed growth leading up to 2020, followed by a dip in 2022.
# Identifying the real estate sales volume in the top 5 states over the years
# Filter out datas that only contains the top 5 states in terms of volume
top_5_states = final_real_estate[final_real_estate.state.isin(["New Jersey", "Connecticut", "New York", "Pennsylvania", "Massachusetts"])]
top_5_states_sales_volume = pd.DataFrame(top_5_states.groupby(['year', 'state']).size()).rename(columns = {0: "State_Count"})
# Plotting the
Palette = ["Brown", "Green", "Purple", "Orange", "Blue"]
sns.set_palette(Palette)
top_5_states_sales_volume = sns.relplot(x = "year", y = "State_Count",
data = top_5_states_sales_volume, kind = "line", hue = "state")
top_5_states_sales_volume.fig.suptitle("Top 5 States Real Estate Sales Volume from year 2000 to 2022", y = 1)
Text(0.5, 1, 'Top 5 States Real Estate Sales Volume from year 2000 to 2022')
The top 5 US states for sales volume and sales value are the same, however, the New York state has taken the first position having the most real estate sales value for the past 22 years.
#2b Identifying the top 5 states based Sale_Price_Thousandson real estate sales value
final_real_estate.groupby('state')['Sale_Price_Thousands'].sum().sort_values().tail(5).plot(kind = "barh",
color = ['Blue', 'Green', 'Brown', 'Purple', "Orange"])
plt.xlabel("Real Estate Sales Value")
plt.ylabel("States")
plt.title("Top 5 US States Real Estate Sales Value")
Text(0.5, 1.0, 'Top 5 US States Real Estate Sales Value')
From the Prices per Sqft analysis, it shows New York is the only state with price per sqaure feet over 500 dollars while other states have quite similar median price per sqft. This would provide some insights of value for money and the prices of properties in each states, however it is important to note that this metric does not consider qualitative factors such as property layout and design and the benefits specific locations bring.
filtered_data = final_real_estate[~final_real_estate['state'].isin(['Puerto Rico', 'Virgin Islands'])]
state_pricespersqft = filtered_data.groupby('state')['price_per_sqft'].median().sort_values()
# generate unique colours for each state using seaborn
colors = sns.color_palette('tab20', len(state_pricespersqft))
state_pricespersqft.plot(kind = 'barh', color = colors)
plt.xlabel("Prices per Sqft")
plt.ylabel("States")
plt.title("Prices per Sqft Across the US States")
Text(0.5, 1.0, 'Prices per Sqft Across the US States')
#3 Identify the Information of the Most expensive real estate in each year.
Most_Expensive_Real_Estate = top_5_states.loc[top_5_states.groupby('year')['Sale_Price_Thousands'].idxmax()]
display(Most_Expensive_Real_Estate)
| status | Sale_Price_Thousands | bed | bath | full_address | city | state | zip_code | house_size | sold_date | year | price_per_sqft | room_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 922453 | for_sale | 39000.0 | 8.0 | 13.0 | 555 Lake Ave, Greenwich, CT, 06830 | Greenwich | Connecticut | 6830.0 | 18954.0 | 2000-07-17 | 2000 | 2057.613169 | 0.615385 |
| 505291 | for_sale | 17900.0 | 5.0 | 8.0 | 130 E 71st St, New York City, NY, 10021 | New York City | New York | 10021.0 | 5344.0 | 2001-01-16 | 2001 | 3349.550898 | 0.625000 |
| 906083 | for_sale | 875000.0 | 9.0 | 2.0 | 952 E 223 St Units 4858 & 66, Bronx, NY, 10458 | Bronx | New York | 10458.0 | 2440.0 | 2002-12-30 | 2002 | 358606.557377 | 4.500000 |
| 422397 | for_sale | 60000.0 | 9.0 | 10.0 | 450 Brickyard Rd, Woodstock, CT, 06281 | Woodstock | Connecticut | 6281.0 | 18777.0 | 2003-08-28 | 2003 | 3195.398626 | 0.900000 |
| 716514 | for_sale | 25000.0 | 6.0 | 11.0 | 70 Broad St, New York City, NY, 10004 | New York City | New York | 10004.0 | 19478.0 | 2004-07-12 | 2004 | 1283.499333 | 0.545455 |
| 564996 | for_sale | 18900.0 | 6.0 | 9.0 | 20 E 94th St, New York City, NY, 10128 | New York City | New York | 10128.0 | 8500.0 | 2005-03-24 | 2005 | 2223.529412 | 0.666667 |
| 734908 | for_sale | 58500.0 | 5.0 | 6.0 | 151 E 58th St Ph 50, New York, NY, 10022 | New York | New York | 10022.0 | 9675.0 | 2006-09-08 | 2006 | 6046.511628 | 0.833333 |
| 743986 | for_sale | 20500.0 | 4.0 | 5.0 | New York City, NY, 10023 | New York City | New York | 10023.0 | 3333.0 | 2007-12-20 | 2007 | 6150.615062 | 0.800000 |
| 743290 | for_sale | 50000.0 | 6.0 | 8.0 | 995 5th Ave Unit Ph, New York City, NY, 10028 | New York City | New York | 10028.0 | 6891.0 | 2008-09-12 | 2008 | 7255.840952 | 0.750000 |
| 64850 | for_sale | 15000.0 | 8.0 | 10.0 | 47 White Bridge Rd, Chatham, NY, 12136 | Chatham | New York | 12136.0 | 14058.0 | 2009-10-20 | 2009 | 1067.008109 | 0.800000 |
| 565012 | for_sale | 19995.0 | 6.0 | 8.0 | 326 W 80th St, New York, NY, 10024 | New York | New York | 10024.0 | 8800.0 | 2010-05-24 | 2010 | 2272.159091 | 0.750000 |
| 744145 | for_sale | 25000.0 | 4.0 | 3.0 | 15 Central Park W Apt 27A, New York, NY, 10023 | New York | New York | 10023.0 | 3105.0 | 2011-06-16 | 2011 | 8051.529791 | 1.333333 |
| 781485 | for_sale | 80000.0 | 9.0 | 13.0 | 973 Fifth Ave, New York, NY, 10075 | New York | New York | 10075.0 | 15000.0 | 2012-07-02 | 2012 | 5333.333333 | 0.692308 |
| 743414 | for_sale | 45000.0 | 4.0 | 10.0 | 12 E 82nd St, New York City, NY, 10028 | New York City | New York | 10028.0 | 12000.0 | 2013-02-05 | 2013 | 3750.000000 | 0.400000 |
| 908133 | for_sale | 32000.0 | 6.0 | 12.0 | 16 Hurlingham Dr, Greenwich, CT, 06831 | Greenwich | Connecticut | 6831.0 | 23700.0 | 2014-02-21 | 2014 | 1350.210970 | 0.500000 |
| 836022 | for_sale | 22000.0 | 5.0 | 6.0 | 25 N Moore St Unit 16ABC, New York City, NY, 1... | New York City | New York | 10013.0 | 7020.0 | 2015-10-06 | 2015 | 3133.903134 | 0.833333 |
| 734883 | for_sale | 135000.0 | 5.0 | 6.0 | 432 Park Ave Unit 79, New York, NY, 10022 | New York | New York | 10022.0 | 8055.0 | 2016-06-07 | 2016 | 16759.776536 | 0.833333 |
| 719369 | for_sale | 45000.0 | 4.0 | 5.0 | 30 Park Pl Ph 81, Manhattan, NY, 10007 | Manhattan | New York | 10007.0 | 5443.0 | 2017-02-01 | 2017 | 8267.499541 | 0.800000 |
| 775652 | for_sale | 35000.0 | 3.0 | 5.0 | 432 Park Ave Apt 71B, New York, NY, 10022 | New York | New York | 10022.0 | 4019.0 | 2018-06-08 | 2018 | 8708.633989 | 0.600000 |
| 640427 | for_sale | 29900.0 | 5.0 | 3.0 | 944 Providence Rd, Newtown Square, PA, 19073 | Newtown Square | Pennsylvania | 19073.0 | 2900.0 | 2019-02-07 | 2019 | 10310.344828 | 1.666667 |
| 743002 | for_sale | 34995.0 | 4.0 | 6.0 | 1010 Park Ave Unit Ph, Manhattan, NY, 10028 | Manhattan | New York | 10028.0 | 6745.0 | 2020-09-01 | 2020 | 5188.287620 | 0.666667 |
| 507093 | for_sale | 33000.0 | 5.0 | 6.0 | 180 E 88th St Unit Ph, New York, NY, 10128 | New York | New York | 10128.0 | 5508.0 | 2021-07-12 | 2021 | 5991.285403 | 0.833333 |
| 883664 | for_sale | 35000.0 | 4.0 | 5.0 | 443 Greenwich St Ph G, New York, NY, 10013 | New York | New York | 10013.0 | 5375.0 | 2022-01-10 | 2022 | 6511.627907 | 0.800000 |
Out of 22 years, the New York state sold the most expensive real estate in the US for 18 years, outbeating other states by a huge percentage. Moreover, this may mean that real estates in New York state tends to appreciate more and has higher value due to higher living standards.
#3b At which states are the most expensive real estates sold located in?
plot = Most_Expensive_Real_Estate['state'].value_counts().plot(kind = "bar",
color = ['Orange', 'Brown', 'Blue', 'Purple', "Green"])
plt.xticks(rotation = 0)
plt.xlabel("State")
plt.ylabel("Most Expensive Real Estate Count")
plt.title("Most Expensive Real Estate Count in the Top 5 States")
Text(0.5, 1.0, 'Most Expensive Real Estate Count in the Top 5 States')
From the correlation matrix shown below, turns out number of bathrooms has the strongest positive linear relationship to the sale price compared to other variables. Even thoughg the relationship is still relatively weak. It may be unexpected that the size of the real estate is not the variable with the strongest linear relationship. However, its important to point out that there are other factors that influences real estate prices such as the environment, location choice and the ability of the property to demand higher rents which is influenced by the purpose of using the land.
# Selecting columns for correlation calculation
columns = ['Sale_Price_Thousands', 'bath', 'bed', 'house_size']
subset = top_5_states[columns]
display(subset.corr())
| Sale_Price_Thousands | bath | bed | house_size | |
|---|---|---|---|---|
| Sale_Price_Thousands | 1.000000 | 0.153041 | 0.090015 | 0.118772 |
| bath | 0.153041 | 1.000000 | 0.652077 | 0.603378 |
| bed | 0.090015 | 0.652077 | 1.000000 | 0.494284 |
| house_size | 0.118772 | 0.603378 | 0.494284 | 1.000000 |
The the correlation between House Size and House Price is 0.119. Which shows that House size does not have a strong positive relationship with the housing prices
correlation = final_real_estate[['house_size', 'Sale_Price_Thousands']].corr()
# Visualize
plt.figure(figsize=(8, 5))
sns.scatterplot(data=final_real_estate, x='house_size', y='Sale_Price_Thousands', alpha=0.4)
plt.title("House Size vs. Price")
plt.xlabel("House Size (sqft)")
plt.ylabel("Price ($)")
plt.show()
The number of bathrom has the strongest positive relationship with the housing price based on the correlation matrix (0.15) but it still very weak to be considered a positiive relationship, however it is important to note that there are other factors that would contribute to housing prices such as the population density of the state and other aspects of the property itself.
correlation = final_real_estate[['bath', 'Sale_Price_Thousands']].corr()
# Visualize
plt.figure(figsize=(8, 5))
sns.scatterplot(data=final_real_estate, x='bath', y='Sale_Price_Thousands', alpha=0.4)
plt.title("Number of Bathrooms vs. Price")
plt.xlabel("Number of Bathrooms")
plt.ylabel("Price ($)")
plt.show()
Based on the table below, a 3 bedroom and 2 bathroom property is the most popular real estate in terms of sales volume every year for the past 22 years. This maybe due to the fact that this is the average family size in the US which drives the most demand for this type of property structure. Moreover, the spike of property sales during 2021 may be because of the covid 19 pandemic which may have affected the financial wellbeing of alot of people, which led to the sale of property. Moreover, the drop in property sales on 2022 may be due to insufficient data collected for the year.
# Value_counts() would remove the column names, but by reindexing, we can set the column names
raw_top_house = final_real_estate.groupby('year')[['bed', 'bath']].value_counts().reset_index()
raw_top_house.columns = ['year', 'bed', ' bath', 'counts']
# Generating the index to identify the max counts for each year (idx is in boolean)
idx = raw_top_house.groupby(['year'])['counts'].transform(max) == raw_top_house['counts']
# Will only return rows that matches True
popular_property = raw_top_house[idx]
import plotly.express as px
# Create the line chart
fig = px.line(
popular_property,
x='year',
y='counts',
markers=True,
title= "Sales Trend for the Most demanded property over the years (3 bedroom and 2 bathrooms)",
labels={'year': 'Year', 'counts': 'Number of Listings'}
)
# Customize hover tooltip (optional)
fig.update_traces(hovertemplate='Year: %{x}<br>Listings: %{y}')
# Show the interactive chart
fig.show(renderer = "notebook")
From the results, property with unbalanced bathroom to bedroom ratio property or properties that are much larger in size (bedroom and bathroom numbers) tend to be less popular due to the average household size which makes such properties harder to sell.
# Value_counts() would remove the column names, but by reindexing, we can set the column names
raw_top_house = final_real_estate.groupby('year')[['bed', 'bath']].value_counts().reset_index()
raw_top_house.columns = ['year', 'bed', ' bath', 'counts']
# Generating the index to identify the max counts for each year (idx is in boolean)
idx_min = raw_top_house.groupby(['year'])['counts'].transform(min) == raw_top_house['counts']
# Will only return rows that matches True
pd.set_option('display.max_rows', 500)
display(raw_top_house[idx_min].head(10))
| year | bed | bath | counts | |
|---|---|---|---|---|
| 35 | 2000 | 11.0 | 11.0 | 1 |
| 36 | 2000 | 11.0 | 12.0 | 1 |
| 37 | 2000 | 9.0 | 9.0 | 1 |
| 38 | 2000 | 11.0 | 6.0 | 1 |
| 39 | 2000 | 9.0 | 7.0 | 1 |
| 40 | 2000 | 7.0 | 6.0 | 1 |
| 41 | 2000 | 1.0 | 3.0 | 1 |
| 42 | 2000 | 3.0 | 5.0 | 1 |
| 43 | 2000 | 3.0 | 6.0 | 1 |
| 44 | 2000 | 4.0 | 6.0 | 1 |
has_duplicates = final_real_estate.duplicated().any()
# Splitting the dataset by popularity
#1 Count how many times each (bed, bath) combination appears
combo_counts = final_real_estate.groupby(['bed', 'bath']).size().reset_index(name = 'Sales Count (Bed and Bath)')
#2 Merge the counts to the original dataset
final_real_estate_counts = final_real_estate.merge(combo_counts, on = ['bed', 'bath'], how = 'left')
#3 use quantiles to define popularity levels
q_low = final_real_estate_counts['Sales Count (Bed and Bath)'].quantile(0.33)
q_high = final_real_estate_counts['Sales Count (Bed and Bath)'].quantile(0.66)
# Create a new column for popularity
def popularity_label(count):
if count <= q_low:
return 'Least Popular'
elif count <= q_high:
return 'Average Popularity'
else:
return 'Most Popular'
# apply the function to the dataset
final_real_estate_counts['popularity'] = final_real_estate_counts['Sales Count (Bed and Bath)'].apply(popularity_label)
From the below table, it is showned that majority of the property with room ratio of less than 1 or more than 1.5 are categorised as less popular. Generally speaking less popular property do not have a very good balance of bedroom and bathroom, or property with too much bedrooms and bathroom tend to less attractive as it exceed the average people per household in the US and it may be difficult for investors to resell it due to its size and demand for the property.
least_popular_property = final_real_estate_counts[final_real_estate_counts['popularity'] == 'Least Popular']
display(least_popular_property.groupby(['bed', 'bath', 'room_ratio']).size().reset_index().head(10))
| bed | bath | room_ratio | 0 | |
|---|---|---|---|---|
| 0 | 1.0 | 2.0 | 0.500000 | 418 |
| 1 | 1.0 | 3.0 | 0.333333 | 18 |
| 2 | 1.0 | 4.0 | 0.250000 | 3 |
| 3 | 1.0 | 10.0 | 0.100000 | 1 |
| 4 | 2.0 | 1.0 | 2.000000 | 2284 |
| 5 | 2.0 | 3.0 | 0.666667 | 894 |
| 6 | 2.0 | 4.0 | 0.500000 | 57 |
| 7 | 2.0 | 5.0 | 0.400000 | 4 |
| 8 | 3.0 | 4.0 | 0.750000 | 650 |
| 9 | 3.0 | 5.0 | 0.600000 | 67 |
From the analysis, a 2-3 bedroom and 2 bathroom property are the most popular because the demand for a property is heavily driven by the usage of the property, on average a US household consist of 3.15 people, thus property with this size is the most popular.
most_popular_property = final_real_estate_counts[final_real_estate_counts['popularity'] == 'Most Popular']
# identify the unique room_ratio
most_popular_property.groupby(['bed', 'bath','room_ratio']).size().reset_index()
| bed | bath | room_ratio | 0 | |
|---|---|---|---|---|
| 0 | 2.0 | 2.0 | 1.0 | 3545 |
| 1 | 3.0 | 2.0 | 1.5 | 5848 |
The top 5 states for median prices for a 3 bedroom and 2 bathroom property are (New York, New Jersey, Connecticut, Massachusetts, Pennsylvania), we can see that there is minimal variability in the median price of this type of property in the state of New Jersey, Connecticut, Pennsylvania) which may mean it has relatively lower external economic influence and steady housing supply and demand
Moreover the market for the state of New York and Massachusetts seems quite volative, this may mean these states attracts alot of investment and migration during economic booms, thus strongly follows the economic performance of the country. This may be good for short term investors that profits from short term flipping and increase in rental yields.
# Filter out the datas for 3 bedroom and 2 bathrooms
Threebathroom_Twobathroom_Properties = final_real_estate[(final_real_estate['bed'] == 3) & (final_real_estate['bath'] == 2)]
# Group by state and calculate the median prices of the properties (inc all states)
median_prices = Threebathroom_Twobathroom_Properties.groupby(['state','year'])['Sale_Price_Thousands'].median().reset_index()
# Identifying the top 5 states
top_states = Threebathroom_Twobathroom_Properties['state'].value_counts().nlargest(5).index
# identifying the median prices for those top 5 states
median_prices_top = median_prices[median_prices['state'].isin(top_states)]
# Plotting the graph
plt.figure(figsize=(14, 8))
sns.lineplot(data=median_prices_top, x='year', y='Sale_Price_Thousands', hue='state', marker='o')
plt.title('Median Price of 3 Bed / 2 Bath Properties by State Over Time')
plt.xlabel('Year')
plt.ylabel('Median Price')
plt.legend(title='State', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()
The top 5 States for real estate sales in the US are New Jersey, Connecticut, New York, Pennsylvania and Massachusetts for the past 22 years. If this trend continues, the real estates in these states may be potential investment options as the market is quite active, thus more liquid. However, real estates in these states may be potentially more expensive, but at the same time these states may have more employment opportunities. If quality of life such as pollution rates are of concern, other less crowded states maybe better choices for living.
Moreover, real estate values are mainly driven by value generating drivers such as location and the property's ability to generate rent rather than just real estate size. It is important to note that the balance between bathroom and bedroom is very important indicating the liquidity of a property, demands for property is mainly driven by needs of the buyer. If the buying motive is for investment purposes (potential future sale or rent generation), it is important to look into what the target market is demanding.
Furthermore, a 3 bedroom and 2 bathroom real estate is the most demanded real estate for the past 22 years and if this trend continues to the future, this is mainly driven by the average household size in the US. This type of real estate would be a safe and great investment option. It is important to note that once the investment or purchase motive is finalized, finding the right combination of location and other factors is very important for both investors and home buyers.